167 research outputs found
Deep Epitomic Convolutional Neural Networks
Deep convolutional neural networks have recently proven extremely competitive
in challenging image recognition tasks. This paper proposes the epitomic
convolution as a new building block for deep neural networks. An epitomic
convolution layer replaces a pair of consecutive convolution and max-pooling
layers found in standard deep convolutional neural networks. The main version
of the proposed model uses mini-epitomes in place of filters and computes
responses invariant to small translations by epitomic search instead of
max-pooling over image positions. The topographic version of the proposed model
uses large epitomes to learn filter maps organized in translational
topographies. We show that error back-propagation can successfully learn
multiple epitomic layers in a supervised fashion. The effectiveness of the
proposed method is assessed in image classification tasks on standard
benchmarks. Our experiments on Imagenet indicate improved recognition
performance compared to standard convolutional neural networks of similar
architecture. Our models pre-trained on Imagenet perform excellently on
Caltech-101. We also obtain competitive image classification results on the
small-image MNIST and CIFAR-10 datasets.Comment: 9 page
Efficient variational inference in large-scale Bayesian compressed sensing
We study linear models under heavy-tailed priors from a probabilistic
viewpoint. Instead of computing a single sparse most probable (MAP) solution as
in standard deterministic approaches, the focus in the Bayesian compressed
sensing framework shifts towards capturing the full posterior distribution on
the latent variables, which allows quantifying the estimation uncertainty and
learning model parameters using maximum likelihood. The exact posterior
distribution under the sparse linear model is intractable and we concentrate on
variational Bayesian techniques to approximate it. Repeatedly computing
Gaussian variances turns out to be a key requisite and constitutes the main
computational bottleneck in applying variational techniques in large-scale
problems. We leverage on the recently proposed Perturb-and-MAP algorithm for
drawing exact samples from Gaussian Markov random fields (GMRF). The main
technical contribution of our paper is to show that estimating Gaussian
variances using a relatively small number of such efficiently drawn random
samples is much more effective than alternative general-purpose variance
estimation techniques. By reducing the problem of variance estimation to
standard optimization primitives, the resulting variational algorithms are
fully scalable and parallelizable, allowing Bayesian computations in extremely
large-scale problems with the same memory and time complexity requirements as
conventional point estimation techniques. We illustrate these ideas with
experiments in image deblurring.Comment: 8 pages, 3 figures, appears in Proc. IEEE Workshop on Information
Theory in Computer Vision and Pattern Recognition (in conjunction with
ICCV-11), Barcelona, Spain, Nov. 201
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
In this work we address the task of semantic image segmentation with Deep
Learning and make three main contributions that are experimentally shown to
have substantial practical merit. First, we highlight convolution with
upsampled filters, or 'atrous convolution', as a powerful tool in dense
prediction tasks. Atrous convolution allows us to explicitly control the
resolution at which feature responses are computed within Deep Convolutional
Neural Networks. It also allows us to effectively enlarge the field of view of
filters to incorporate larger context without increasing the number of
parameters or the amount of computation. Second, we propose atrous spatial
pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP
probes an incoming convolutional feature layer with filters at multiple
sampling rates and effective fields-of-views, thus capturing objects as well as
image context at multiple scales. Third, we improve the localization of object
boundaries by combining methods from DCNNs and probabilistic graphical models.
The commonly deployed combination of max-pooling and downsampling in DCNNs
achieves invariance but has a toll on localization accuracy. We overcome this
by combining the responses at the final DCNN layer with a fully connected
Conditional Random Field (CRF), which is shown both qualitatively and
quantitatively to improve localization performance. Our proposed "DeepLab"
system sets the new state-of-art at the PASCAL VOC-2012 semantic image
segmentation task, reaching 79.7% mIOU in the test set, and advances the
results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and
Cityscapes. All of our code is made publicly available online.Comment: Accepted by TPAM
MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features
In this work, we tackle the problem of instance segmentation, the task of
simultaneously solving object detection and semantic segmentation. Towards this
goal, we present a model, called MaskLab, which produces three outputs: box
detection, semantic segmentation, and direction prediction. Building on top of
the Faster-RCNN object detector, the predicted boxes provide accurate
localization of object instances. Within each region of interest, MaskLab
performs foreground/background segmentation by combining semantic and direction
prediction. Semantic segmentation assists the model in distinguishing between
objects of different semantic classes including background, while the direction
prediction, estimating each pixel's direction towards its corresponding center,
allows separating instances of the same semantic class. Moreover, we explore
the effect of incorporating recent successful methods from both segmentation
and detection (i.e. atrous convolution and hypercolumn). Our proposed model is
evaluated on the COCO instance segmentation benchmark and shows comparable
performance with other state-of-art models.Comment: 10 pages including referenc
Towards Accurate Multi-person Pose Estimation in the Wild
We propose a method for multi-person detection and 2-D pose estimation that
achieves state-of-art results on the challenging COCO keypoints task. It is a
simple, yet powerful, top-down approach consisting of two stages.
In the first stage, we predict the location and scale of boxes which are
likely to contain people; for this we use the Faster RCNN detector. In the
second stage, we estimate the keypoints of the person potentially contained in
each proposed bounding box. For each keypoint type we predict dense heatmaps
and offsets using a fully convolutional ResNet. To combine these outputs we
introduce a novel aggregation procedure to obtain highly localized keypoint
predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression
(NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based
confidence score estimation, instead of box-level scoring.
Trained on COCO data alone, our final system achieves average precision of
0.649 on the COCO test-dev set and the 0.643 test-standard sets, outperforming
the winner of the 2016 COCO keypoints challenge and other recent state-of-art.
Further, by using additional in-house labeled data we obtain an even higher
average precision of 0.685 on the test-dev set and 0.673 on the test-standard
set, more than 5% absolute improvement compared to the previous best performing
method on the same dataset.Comment: Paper describing an improved version of the G-RMI entry to the 2016
COCO keypoints challenge (http://image-net.org/challenges/ilsvrc+coco2016).
Camera ready version to appear in the Proceedings of CVPR 201
Untangling Local and Global Deformations in Deep Convolutional Networks for Image Classification and Sliding Window Detection
Deep Convolutional Neural Networks (DCNNs) commonly use generic `max-pooling'
(MP) layers to extract deformation-invariant features, but we argue in favor of
a more refined treatment. First, we introduce epitomic convolution as a
building block alternative to the common convolution-MP cascade of DCNNs; while
having identical complexity to MP, Epitomic Convolution allows for parameter
sharing across different filters, resulting in faster convergence and better
generalization. Second, we introduce a Multiple Instance Learning approach to
explicitly accommodate global translation and scaling when training a DCNN
exclusively with class labels. For this we rely on a `patchwork' data structure
that efficiently lays out all image scales and positions as candidates to a
DCNN. Factoring global and local deformations allows a DCNN to `focus its
resources' on the treatment of non-rigid deformations and yields a substantial
classification accuracy improvement. Third, further pursuing this idea, we
develop an efficient DCNN sliding window object detector that employs explicit
search over position, scale, and aspect ratio. We provide competitive image
classification and localization results on the ImageNet dataset and object
detection results on the Pascal VOC 2007 benchmark.Comment: 13 pages, 7 figures, 5 tables. arXiv admin note: substantial text
overlap with arXiv:1406.273
- …